-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry **every** database error #698
Conversation
It took me surprisingly long to answer a simple question: What will happen with simple configuration errors, like a wrong username for example. Answer: it will log a warning like this immediately and exit with the same error at a later point:
So if you did something wrong while setting it up, this PR shouldn't delay you from seeing the problem. Why did it take me so long to test this? I wanted to test it within the docker image and the behavior totally didn't match my expectations (it logged that would retry but exited immediately afterwards). What happened? The docker image has an entrypoint which itself connects to the database and this uses a fixed version of the Icinga DB code. So long story short, that has to be updated as well, no matter in which way we change the error handling. |
I've created the release issue and added a corresponding point to the checklist there so that it won't be forgotten: #707 |
I've added this to the PR description as that PR becomes obsolete if we do it this way.
Not sure about that one. Yes, the issue mentions an "Lock wait timeout exceeded" error, but only after |
I've just created an issue for this #717. |
Please see #717 (comment). |
Added commit ef34b7d to account for #686 (comment). PR is now ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The HA transaction retry handler still does not set a timeout and hangs forever without an error (if the same error is always returned and the debug log isn't used). Is it deliberate or missed to adjust it?
pkg/retry/retry.go
Outdated
defer time.AfterFunc(settings.Timeout, func() { | ||
cancelTimeWindow() | ||
}).Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer time.AfterFunc(settings.Timeout, func() { | |
cancelTimeWindow() | |
}).Stop() | |
defer time.AfterFunc(settings.Timeout, cancelTimeWindow).Stop() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes perfect sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I will have a look. |
8e9fc70
to
ff0e537
Compare
I also added a timeout to ha. |
There is one more thing I have to in fix in ha:
We're using a deadline ctx for realize(). |
ff0e537
to
f8fa60c
Compare
However, keep in mind that this deadline should probably keep existing, as it's based on the time when the heartbeat exprires: Lines 202 to 206 in 2826af4
Retrying the HA transaction with an outdated heartbeat would make no sense. And if it retries for longer than that, it may delay the forwarding of the heartbeat loss to the rest of the code. And as I'm looking at that |
f8fa60c
to
487fa58
Compare
Before, with `Timeout >0`, we had to check errors for `nil` because we were using a deadline context that might have cancelled before the retry function completed with an error, but since we now pass `ctx` as-is, this is no longer necessary.
Ensure that updating/inserting of the instance row is also completed by the current heartbeat's expiry time in takeover scenarios. Otherwise, there is a risk that the takeover will be performed with an already expired heartbeat if the attempt takes longer than the expiry time of the heartbeat.
b2fd89f
to
8d187be
Compare
@@ -55,8 +55,8 @@ func (c RetryConnector) Connect(ctx context.Context) (driver.Conn, error) { | |||
shouldRetry, | |||
backoff.NewExponentialWithJitter(time.Millisecond*128, time.Minute*1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the nesting of the retry.WithBackoff()
within those in other places that execute SQL queries and these backoff settings, some strange effects can be observed: If you stop your database for something like 4m30s, you may see some attempts recovering and others - quite some time after that - failing fatally because those exceeded the 5 minutes. I believe this happens to the maximal backoff of a minute given here, i.e. there may be a minute without an attempt and after that, both the inner and outer 5 minute timeout are exceeded.
Also, if we're adding retry.WithBackoff()
, do we even still need it here? Just for fun, I removed all the retry from this function (so that it just does Connect()
and initConn()
with no loops or anything around it) and it seemed to work more reliably, only the logging became very annoying with queries and stacktraces within a single line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following changes address your considerations:
HA
now uses a Timer
instead of a Ticker
that is reset properly even if expired (and not yet drained): 81085c0.
If the timeout expires during the sleep phase between attempts one final retry attempt will be made: 3a3baaf.
only the logging became very annoying with queries and stacktraces within a single line
I have something in my stash which fixes this. zap
adds the errorVerbose
field if the error implements fmt.Formatter
which is true for all our wrapped errors. Simple solution would be to replace zap.Error()
calls with a custom implementation that returns a "silent" error, e.g.
package logging
...
type stackTracer interface {
StackTrace() errors.StackTrace
}
type errNoStackTrace struct {
e error
}
func (e errNoStackTrace) Error() string {
return e.e.Error()
}
func Error(e error) zap.Field {
if _, ok := e.(stackTracer); ok {
return zap.Error(errNoStackTrace{e})
}
return zap.Error(e)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only the logging became very annoying with queries and stacktraces within a single line
I have something in my stash which fixes this.
zap
adds theerrorVerbose
field if the error implementsfmt.Formatter
which is true for all our wrapped errors. Simple solution would be to replacezap.Error()
calls with a custom implementation that returns a "silent" error, e.g.
However, the PR just keeps the nested retrying for now, thereby avoiding this issue.
7545764
to
fc00fa9
Compare
Since we are now retrying every database error, we also need to set a retry timeout.
All of our error callbacks are used to log the error and indicate that we are retrying. Previously, in the case of context errors or non-retryable errors, we would have called these too, which would have resulted in misleading log messages.
The retryable function may exit prematurely due to context errors that shouldn't be retried. Before, we checked the returned error for context errors, i.e. used `errors.Is()` to compare it to `Canceled` and `DeadlineExceeded` which also yields `true` for errors that implement `Is()` accordingly. For example, this applies to some non-exported Go `net` errors. Now we explicitly check the context error instead.
Logging of the `attempt` is a meaningless metric as it is not constantly logged but only when the retryable error changes, and it has no context as there is no such thing as max attempts.
This change simplifies the use of `attempt` as a number for reading in log messages and `if`s. Also before, with `attempt` starting with `0`, the second attempt would have been taken immediately, as our backoff implementation returns `0` in this case. Co-Authored-By: Alvar Penning <[email protected]>
`WithBackoff()` will now make one final retry if the timeout expires during the sleep phase between attempts, which can be a long period depending on the attempts made and the maximum sleep time.
fc00fa9
to
3a3baaf
Compare
With the latest changes, a reconnect close to the border looks like this:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No more complaints from my side. Compared to the others, I haven't spent much time looking into the details of this PR so far, so I think it would be good to get another approval on this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2024-04-11T09:43:21.347+0200 INFO database Reconnected to database {"after": "4m50.284240542s", "attempts": 14}
2024-04-11T09:43:25.079+0200 INFO database Reconnected to database {"after": "4m44.25903525s", "attempts": 14}
2024-04-11T09:43:28.895+0200 INFO database Reconnected to database {"after": "2m45.294415s", "attempts": 12}
2024-04-11T09:43:29.917+0200 INFO database Reconnected to database {"after": "4m28.553121625s", "attempts": 15}
2024-04-11T09:43:31.347+0200 INFO database Reconnected to database {"after": "4m54.634997417s", "attempts": 15}
2024-04-11T09:43:33.639+0200 INFO database Reconnected to database {"after": "5m0.007767875s", "attempts": 15}
2024-04-11T09:43:34.152+0200 INFO database Reconnected to database {"after": "5m0.007966958s", "attempts": 15}
2024-04-11T09:43:34.820+0200 INFO database Reconnected to database {"after": "3m53.386925458s", "attempts": 13}
2024-04-11T09:43:34.922+0200 INFO database Reconnected to database {"after": "5m0.006823084s", "attempts": 15}
2024-04-11T09:43:38.775+0200 INFO database Reconnected to database {"after": "5m0.00952125s", "attempts": 15}
2024-04-11T09:43:39.100+0200 INFO history-sync Synced 622 state history items
2024-04-11T09:43:39.100+0200 INFO history-sync Synced 495 notification history items
2024-04-11T09:43:39.272+0200 INFO runtime-updates Upserted 17 ServiceState items
2024-04-11T09:43:39.846+0200 INFO database Reconnected to database {"after": "4m53.632643916s", "attempts": 15}
2024-04-11T09:43:41.586+0200 INFO database Reconnected to database {"after": "3m1.067891625s", "attempts": 12}
2024-04-11T09:43:46.617+0200 INFO database Reconnected to database {"after": "4m20.077357709s", "attempts": 13}
2024-04-11T09:43:51.625+0200 INFO database Reconnected to database {"after": "4m5.054033166s", "attempts": 13}
2024-04-11T09:43:55.058+0200 INFO database Reconnected to database {"after": "4m18.504347416s", "attempts": 14}
During testing, I just encountered a race condition where my Galera cluster was not yet ready, causing the initial schema check to fail. ``` 2024-04-11T08:13:40.401Z INFO icingadb Starting Icinga DB daemon (1.1.1) 2024-04-11T08:13:40.401Z INFO icingadb Connecting to database at 'mysql:3306' 2024-04-11T08:13:40.404Z FATAL icingadb Error 1047 (08S01): WSREP has not yet prepared node for application use can't check database schema version github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115 main.run /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74 main.main /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37 runtime.main /usr/local/go/src/runtime/proc.go:271 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1695 exit status 1 ``` This change now also retries the initial cluster check. References #698.
During testing, I just encountered a race condition where my Galera cluster was not yet ready, causing the initial schema check to fail. ``` 2024-04-11T08:13:40.401Z INFO icingadb Starting Icinga DB daemon (1.1.1) 2024-04-11T08:13:40.401Z INFO icingadb Connecting to database at 'mysql:3306' 2024-04-11T08:13:40.404Z FATAL icingadb Error 1047 (08S01): WSREP has not yet prepared node for application use can't check database schema version github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115 main.run /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74 main.main /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37 runtime.main /usr/local/go/src/runtime/proc.go:271 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1695 exit status 1 ``` This change now also retries the initial cluster check. References #698.
During testing, I just encountered a race condition where my Galera cluster was not yet ready, causing the initial schema check to fail. ``` 2024-04-11T08:13:40.401Z INFO icingadb Starting Icinga DB daemon (1.1.1) 2024-04-11T08:13:40.401Z INFO icingadb Connecting to database at 'mysql:3306' 2024-04-11T08:13:40.404Z FATAL icingadb Error 1047 (08S01): WSREP has not yet prepared node for application use can't check database schema version github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115 main.run /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74 main.main /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37 runtime.main /usr/local/go/src/runtime/proc.go:271 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1695 exit status 1 ``` This change now also retries the initial cluster check. References #698.
During testing, I just encountered a race condition where my Galera cluster was not yet ready, causing the initial schema check to fail. ``` 2024-04-11T08:13:40.401Z INFO icingadb Starting Icinga DB daemon (1.1.1) 2024-04-11T08:13:40.401Z INFO icingadb Connecting to database at 'mysql:3306' 2024-04-11T08:13:40.404Z FATAL icingadb Error 1047 (08S01): WSREP has not yet prepared node for application use can't check database schema version github.com/icinga/icingadb/pkg/icingadb.(*DB).CheckSchema /go/src/github.com/Icinga/icingadb/pkg/icingadb/db.go:115 main.run /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:74 main.main /go/src/github.com/Icinga/icingadb/cmd/icingadb/main.go:37 runtime.main /usr/local/go/src/runtime/proc.go:271 runtime.goexit /usr/local/go/src/runtime/asm_amd64.s:1695 exit status 1 ``` This change now also retries the initial cluster check. References #698.
This PR introduces several changes to our retry functionality:
[1] Based on a short error description from the vendor, it is difficult to tell for every error whether it can actually be retried without the context of when and how exactly such errors are triggered. There are also database clusters that send their own errors using vendor error codes.
RetryableFunc
gets no longer canclled if it exceedsTimeout
. Before,if Timeout >0
, we ensured to stop retrying afterTimeout
expires by passing a deadline context toRetryFunc
, which aborts the function onceTimeout
lapsed - assuming thatcontext.Done()
is actually taken into account, which applies to all of our usages. I'm pretty sure we didn't think about functions that run longer thanTimeout
and therefore could be canceled prematurely. Since we are now retrying every database error with a timeout of 5 minutes, this could happen with queries that wait for locks having a generous lock wait timeout configured in the database server. Now,RetryableFunc
is granted full execution time and will not be canceled ifTimeout
is exceeded. This means thatWithBackoff
may not stop exactly afterTimeout
expires, or may not retry at all if the first execution ofRetryableFunc
already takes longer thanTimeout
.DELETE
statements from retention are now also retried.HA
now stops retrying after 5 minutes.fixes #618
fixes #651
fixes #677
fixes #730
closes #679